Skip to content

<fix>[kvm]: extend reconnect echo timeout after libvirtd restart#3990

Closed
zstack-robot-2 wants to merge 1 commit into
5.5.22from
sync/haoyu.ding/fix-84691
Closed

<fix>[kvm]: extend reconnect echo timeout after libvirtd restart#3990
zstack-robot-2 wants to merge 1 commit into
5.5.22from
sync/haoyu.ding/fix-84691

Conversation

@zstack-robot-2
Copy link
Copy Markdown
Collaborator

Resolves: ZSTAC-84691

Change-Id: I6668736975707575756d7665647867686e6b6b76

sync from gitlab !9883

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Review Change Stack

Walkthrough

在 KVMHost 连接/部署流程中加入判断部署是否触发 libvirtd 重启的标记,并在 echo-host 步骤使用基于主机活动 VM 数量计算的动态 echo 超时(或回退到默认超时),同时新增工具常量、统计/计算方法及相应单元和集成测试。

变更

Libvirtd 重启 Echo 超时自适应

Layer / File(s) Summary
超时计算工具与配置常量
plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java
新增 TimeUnit 与 VM 状态/VO 导入、三个公开常量(阈值、每 VM 增量秒数、最大秒数),并新增 shouldRestartLibvirtdDuringDeploy()countVmsForLibvirtRestartEchoTimeout()calculateLibvirtRestartEchoTimeoutMillis()
连接/部署流程中的超时计算集成
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
新增私有方法 getLibvirtRestartedEchoTimeout(),在 connectHook 中增加并设置 libvirtRestarted 标志,在 echo-host 步骤计算 echoTimeout 并将其传入 restf.echo(...)(包含重启 kvmagent 后的重试分支)。
单元与集成测试:工具方法与计数验证
test/src/test/java/org/zstack/test/kvm/KVMHostUtilsTest.java, test/src/test/groovy/org/zstack/test/integration/kvm/host/LibvirtRestartEchoTimeoutCase.groovy
新增单元测试覆盖 shouldRestartLibvirtdDuringDeploycalculateLibvirtRestartEchoTimeoutMillis 的阈值/追加/封顶/不降配置场景;新增集成测试验证超时计算与 countVmsForLibvirtRestartEchoTimeout 在 VM 状态变更时的计数变化。

Sequence Diagram(s)

sequenceDiagram
  participant KVMHost
  participant AnsibleDeploy
  participant EchoStep
  KVMHost->>KVMHost: 初始化 libvirtRestarted=false
  KVMHost->>AnsibleDeploy: 组装并执行部署(deployArguments)
  AnsibleDeploy-->>KVMHost: 返回 deployArguments 设置
  KVMHost->>KVMHost: 根据 deployArguments 计算 libvirtRestarted
  KVMHost->>EchoStep: 计算 echoTimeout(默认或基于 vmCount 的延长)
  KVMHost->>EchoStep: 调用 restf.echo(..., timeout=echoTimeout)
Loading

预计审查工作量

🎯 3 (Moderate) | ⏱️ ~20 minutes

小兔跳出新妙法,
libvirtd 要重启,
虚机数量来决策,
超时动态伸缩长,
echo 再也不心急! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题完全符合要求的格式 [scope]: ,长度64字符,未超过72字符限制,清晰总结了主要改动:扩展KVM主机重新连接后的echo超时时间。
Description check ✅ Passed 描述与变更集相关,包含JIRA票号ZSTAC-84691以及Change-Id,明确指出来自gitlab同步,与代码变更内容一致。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/haoyu.ding/fix-84691

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.42.2)
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java

Comment @coderabbitai help to get the list of available commands and usage tips.

@MatheMatrix
Copy link
Copy Markdown
Owner

Comment from yaohua.wu:

Review: MR !9883 — ZSTAC-84691

变更概述:在 KVM 物理机重连的 echo-host 阶段引入动态 echo 超时。仅当本轮 ansible deploy 实际执行(deployed)且 deploy 参数会触发 libvirtd 重启(restartLibvirtdDuringDeploy)时,按物理机承载 VM 数动态延长超时:VM ≤ 100 用默认 RESTFacade.echoTimeout,每超 1 台 +1s,上限 180s。

结论:APPROVED

修复方向正确、范围收敛、与 Jira 设计方案完全一致,未发现阻塞合并的问题。以下为改进建议。

🟡 Warning

1. 新增的两个纯函数缺少单元测试KVMHostUtils.java:154,158

shouldRestartLibvirtdDuringDeploycalculateLibvirtRestartEchoTimeoutMillis 是纯静态函数,边界逻辑较多但无任何测试覆盖。项目已存在 test/src/test/java/org/zstack/test/kvm/KVMHostUtilsTest.java,建议补充以下边界用例:

  • vmCount = 100 / 101(阈值边界,应分别返回默认值 / 默认值+1s)
  • vmCount 极大时结果封顶 180s
  • REST_FACADE_ECHO_TIMEOUT ≥ 180maxExtraSeconds = 0,动态延长退化为 no-op(应返回配置值本身,不应为负)
  • init / restartLibvirtdnull、大小写混合、"false" 的组合

本 issue 标记为 5.5.22「必须解」,建议补齐测试再合入。

🟢 Suggestion

2. 超时调参常量硬编码KVMHostUtils.java:37-39

LIBVIRT_RESTART_ECHO_TIMEOUT_VM_THRESHOLD(100)、_PER_VM_SECONDS(1)、_MAX_SECONDS(180) 均为编译期常量。默认 RESTFacade.echoTimeout=60s 下,VM 数超过 220 即触顶 180s。对承载数百台 VM 的大物理机,libvirtd 重启耗时可能仍超过 180s,超时问题未必完全消除。建议将上限/阈值暴露为 GlobalProperty,便于现场不打补丁调优。

3. 确认扩展点驱动的 libvirtd 重启路径KVMHost.java:5954 附近

restartLibvirtdDuringDeploy 仅依据 deployArgumentsinit / restartLibvirtd 判断。Jira 评论提到 Hygon mdev、SPICE TLS、libvirt 升级兼容等场景也会重启 libvirtd,这些场景由 KvmHostAgentDeploymentExtensionPoint#modifyDeploymentArguments 处理。restartLibvirtdDuringDeploy 的赋值位于扩展点循环之后——若这些扩展确实通过 setRestartLibvirtd("true") 表达重启意图则已覆盖;若通过其他 ansible 变量触发,则动态超时不会生效(最坏退化为现状)。建议确认这些扩展路径均落到 restartLibvirtd 参数上。

已核对项(无问题)

  • 常规路径行为一致!deployed || !restartLibvirtdDuringDeploy 时返回 toMillis(REST_FACADE_ECHO_TIMEOUT),与旧 restf.echo(path, completion) 内部默认实现 echo(url, cb, toMillis(1), toMillis(REST_FACADE_ECHO_TIMEOUT)) 完全一致,绝大多数重连场景无回归。
  • Flow 时序正确restartLibvirtdDuringDeployapply-ansible-playbookrun() 内、runner.run() 之前赋值;deployed 在 runner 回调内赋值;echo-host 为后续 flow,取值时两者均已确定。deployed=false(runner 判定无需 deploy)时正确退回默认超时。
  • 公式与 Jira 设计一致Math.max(0, MAX - default) 正确处理了用户已配置更大全局超时的情况(不覆盖,符合「不覆盖用户显式配置」的设计意图)。
  • 上游新鲜度5.5.22 自分支点以来未改动 plugin/kvmmerge_status: can_be_merged,无 rebase 冲突风险。
  • KVMHost.java:5320@Transactional(readOnly=true) 与同文件 noStorageAccessible() 等既有 private 方法一致(ZStack AspectJ 织入,对 private/自调用有效),无问题。

🤖 Robot Reviewer

@MatheMatrix MatheMatrix force-pushed the sync/haoyu.ding/fix-84691 branch from 24ef2e4 to 719a2aa Compare May 15, 2026 09:32
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java (1)

37-39: 🏗️ Heavy lift

建议将超时调参常量改为可配置项

这三个值目前是编译期常量,现场调优需要改代码发版。建议改成 GlobalProperty(保留当前默认值),便于不同规模主机按需调整而不改变默认行为。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java` around lines 37 -
39, Replace the three compile-time constants in KVMHostUtils
(LIBVIRT_RESTART_ECHO_TIMEOUT_VM_THRESHOLD,
LIBVIRT_RESTART_ECHO_TIMEOUT_PER_VM_SECONDS,
LIBVIRT_RESTART_ECHO_TIMEOUT_MAX_SECONDS) with GlobalProperty-backed
configuration entries (keeping the current numeric values as defaults) and
update all usages in KVMHostUtils to read the values via the GlobalProperty
accessor rather than the static constants; name the properties clearly (e.g.,
libvirt.restart.echo.timeout.vmThreshold, .perVmSeconds, .maxSeconds), provide
the existing defaults, add any necessary `@GlobalProperty` or registration so they
are injectable/readable at runtime, and ensure unit tests or callers that
referenced the static fields now obtain the values from the new GlobalProperty
API.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java`:
- Around line 154-156: The boolean-string check in
shouldRestartLibvirtdDuringDeploy can mis-evaluate values with surrounding
whitespace; update the method to trim both parameters (init and restartLibvirtd)
before comparing (e.g., call trim() safely after null-checking each param) and
then use equalsIgnoreCase on the trimmed strings so values like " true " or
"\ntrue\t" are treated as true.

---

Nitpick comments:
In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java`:
- Around line 37-39: Replace the three compile-time constants in KVMHostUtils
(LIBVIRT_RESTART_ECHO_TIMEOUT_VM_THRESHOLD,
LIBVIRT_RESTART_ECHO_TIMEOUT_PER_VM_SECONDS,
LIBVIRT_RESTART_ECHO_TIMEOUT_MAX_SECONDS) with GlobalProperty-backed
configuration entries (keeping the current numeric values as defaults) and
update all usages in KVMHostUtils to read the values via the GlobalProperty
accessor rather than the static constants; name the properties clearly (e.g.,
libvirt.restart.echo.timeout.vmThreshold, .perVmSeconds, .maxSeconds), provide
the existing defaults, add any necessary `@GlobalProperty` or registration so they
are injectable/readable at runtime, and ensure unit tests or callers that
referenced the static fields now obtain the values from the new GlobalProperty
API.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 9a9b6257-b724-40cc-ba7a-66ed88f82483

📥 Commits

Reviewing files that changed from the base of the PR and between 24ef2e4 and 719a2aa.

📒 Files selected for processing (3)
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java
  • test/src/test/java/org/zstack/test/kvm/KVMHostUtilsTest.java

Comment on lines +154 to +156
public static boolean shouldRestartLibvirtdDuringDeploy(String init, String restartLibvirtd) {
return "true".equalsIgnoreCase(init) || "true".equalsIgnoreCase(restartLibvirtd);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

布尔字符串判断建议先 trim,避免误判

Line 155 当前仅做 equalsIgnoreCase,若参数值包含前后空格(如 " true ")会返回 false,可能导致本应延长的 echo 超时未生效。

建议修改
 public static boolean shouldRestartLibvirtdDuringDeploy(String init, String restartLibvirtd) {
-    return "true".equalsIgnoreCase(init) || "true".equalsIgnoreCase(restartLibvirtd);
+    String normalizedInit = init == null ? null : init.trim();
+    String normalizedRestartLibvirtd = restartLibvirtd == null ? null : restartLibvirtd.trim();
+    return "true".equalsIgnoreCase(normalizedInit) || "true".equalsIgnoreCase(normalizedRestartLibvirtd);
 }

As per coding guidelines: 注意检查来自 Message 的参数是否做过 trim,用户可能在浏览器上复制粘贴的数据带有空格、换行符等。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java` around lines 154 -
156, The boolean-string check in shouldRestartLibvirtdDuringDeploy can
mis-evaluate values with surrounding whitespace; update the method to trim both
parameters (init and restartLibvirtd) before comparing (e.g., call trim() safely
after null-checking each param) and then use equalsIgnoreCase on the trimmed
strings so values like " true " or "\ntrue\t" are treated as true.


long vmCount = Q.New(VmInstanceVO.class)
.eq(VmInstanceVO_.hostUuid, self.getUuid())
.count();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from yaohua.wu:

!=Stopped?

Assert.assertFalse(KVMHostUtils.shouldForceTlsRedeploy(true, false, false));
}

@Test
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from yaohua.wu:

应该要有个集成测试,这个UT在PR系统里不跑的

@MatheMatrix
Copy link
Copy Markdown
Owner

Comment on plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java:

Comment from yaohua.wu:

在这里加是否发生了deploy以及libvirt重启, libvirtRestarted = deployArguments.getRestartLibvirtd()


@Override
public void run(final FlowTrigger trigger, Map data) {
final long echoTimeout = getKvmAgentEchoTimeoutAfterDeploy(deployed, restartLibvirtdDuringDeploy);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from yaohua.wu:

final long echoTimeout = libvirtRestarted ? getLibvirtRestartedEchoTimeout() : TimeUnit.SECONDS.toMillis(CoreGlobalProperty.REST_FACADE_ECHO_TIMEOUT)

Resolves: ZSTAC-84691

Change-Id: I6668736975707575756d7665647867686e6b6b76
@MatheMatrix MatheMatrix force-pushed the sync/haoyu.ding/fix-84691 branch from 719a2aa to e7851c6 Compare May 19, 2026 02:25
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/src/test/groovy/org/zstack/test/integration/kvm/host/LibvirtRestartEchoTimeoutCase.groovy (2)

68-78: ⚡ Quick win

建议显式断言被测 VM 初始状态前提,提升用例稳定性。

当前断言隐含“该 VM 初始应被计入统计”。建议在更新前增加状态前置断言(例如非 Stopped),避免后续夹具调整引入脆弱失败。

可选修改示例
         VmInstanceState originalState = Q.New(VmInstanceVO.class)
                 .select(VmInstanceVO_.state)
                 .eq(VmInstanceVO_.uuid, vm.uuid)
                 .findValue()
+        assert originalState != VmInstanceState.Stopped
         try {
             SQL.New(VmInstanceVO.class)
                     .eq(VmInstanceVO_.uuid, vm.uuid)
                     .set(VmInstanceVO_.state, VmInstanceState.Stopped)
                     .update()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/src/test/groovy/org/zstack/test/integration/kvm/host/LibvirtRestartEchoTimeoutCase.groovy`
around lines 68 - 78, 在更新 VM 状态前显式断言被测 VM 的初始状态以避免用例脆弱性:在读取 originalState (通过
Q.New(VmInstanceVO.class).select(VmInstanceVO_.state).eq(VmInstanceVO_.uuid,
vm.uuid).findValue()) 后,添加断言例如 assert originalState != VmInstanceState.Stopped
来保证该 VM 初始被计入统计;同时可在使用
KVMHostUtils.countVmsForLibvirtRestartEchoTimeout(host.uuid) 前断言其等于
originalCount 以明确前置条件,确保后续对 SQL.New(...).set(...,
VmInstanceState.Stopped).update() 的影响可被可靠验证。

46-57: ⚡ Quick win

为修改全局超时值的测试添加互斥保护,避免并发冲突。

代码中多个测试(包括 KVMHostUtilsTest.java 和本文件)都会临时修改 CoreGlobalProperty.REST_FACADE_ECHO_TIMEOUT,但未使用任何同步机制。若测试并行执行,将导致修改相互干扰。建议将配置修改、断言验证及恢复操作放在同一互斥区间内(例如 synchronized (CoreGlobalProperty.class))以增强鲁棒性。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/src/test/groovy/org/zstack/test/integration/kvm/host/LibvirtRestartEchoTimeoutCase.groovy`
around lines 46 - 57, Tests modify CoreGlobalProperty.REST_FACADE_ECHO_TIMEOUT
without synchronization, causing race conditions when tests run in parallel;
wrap the block that changes, asserts, and restores REST_FACADE_ECHO_TIMEOUT in a
mutual exclusion on CoreGlobalProperty.class (e.g.,
synchronized(CoreGlobalProperty.class)) so the sequence in
LibvirtRestartEchoTimeoutCase (the try/ finally that sets and restores
REST_FACADE_ECHO_TIMEOUT and calls
KVMHostUtils.calculateLibvirtRestartEchoTimeoutMillis) cannot interleave with
other tests (like KVMHostUtilsTest.java) that also mutate the same static
property.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@test/src/test/groovy/org/zstack/test/integration/kvm/host/LibvirtRestartEchoTimeoutCase.groovy`:
- Around line 68-78: 在更新 VM 状态前显式断言被测 VM 的初始状态以避免用例脆弱性:在读取 originalState (通过
Q.New(VmInstanceVO.class).select(VmInstanceVO_.state).eq(VmInstanceVO_.uuid,
vm.uuid).findValue()) 后,添加断言例如 assert originalState != VmInstanceState.Stopped
来保证该 VM 初始被计入统计;同时可在使用
KVMHostUtils.countVmsForLibvirtRestartEchoTimeout(host.uuid) 前断言其等于
originalCount 以明确前置条件,确保后续对 SQL.New(...).set(...,
VmInstanceState.Stopped).update() 的影响可被可靠验证。
- Around line 46-57: Tests modify CoreGlobalProperty.REST_FACADE_ECHO_TIMEOUT
without synchronization, causing race conditions when tests run in parallel;
wrap the block that changes, asserts, and restores REST_FACADE_ECHO_TIMEOUT in a
mutual exclusion on CoreGlobalProperty.class (e.g.,
synchronized(CoreGlobalProperty.class)) so the sequence in
LibvirtRestartEchoTimeoutCase (the try/ finally that sets and restores
REST_FACADE_ECHO_TIMEOUT and calls
KVMHostUtils.calculateLibvirtRestartEchoTimeoutMillis) cannot interleave with
other tests (like KVMHostUtilsTest.java) that also mutate the same static
property.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: d4e15424-073c-474a-976e-14a87ab53830

📥 Commits

Reviewing files that changed from the base of the PR and between 719a2aa and e7851c6.

📒 Files selected for processing (4)
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java
  • test/src/test/groovy/org/zstack/test/integration/kvm/host/LibvirtRestartEchoTimeoutCase.groovy
  • test/src/test/java/org/zstack/test/kvm/KVMHostUtilsTest.java
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/src/test/java/org/zstack/test/kvm/KVMHostUtilsTest.java

@ZStack-Robot ZStack-Robot deleted the sync/haoyu.ding/fix-84691 branch May 19, 2026 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants